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Abstract 

For many types of learners one can compute the statistically "optimal" way to select data. We review how 
these techniques have been used with feedforward neural networks [MacKay, 1992; Cohn, 1994]. We then 
show how the same principles may be used to select data for two alternative, statistically-based learning 
architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural 
networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted 
regression are both efficient and accurate. 
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1 ACTIVE LEARNING - 
BACKGROUND 

An active learning problem is one where the learner has 
the ability or need to influence or select its own training 
data. Many problems of great practical interest allow 
active learning, and many even require it. 

We consider the problem of actively learning a map- 
ping X — ► Y based on a set of training examples 
{(xi, j/i)} 8 ™ i, where X{ £ X and y{ £ Y. The learner 
is allowed to iteratively select new inputs x (possibly 
from a constrained set), observe the resulting output y, 
and incorporate the new examples (x, y) into its training 
set. 

The primary question of active learning is how to 
choose which x to try next. There are many heuristics for 
choosing x based on intuition, including choosing places 
where we don't have data [Whitehead, 1991], where we 
perform poorly [Linden and Weber, 1993], where we have 
low confidence [Thrun and Moller, 1992], where we ex- 
pect it to change our model [Cohn et al, 1990], and 
where we previously found data that resulted in learning 
[Schmidhuber and Storck, 1993]. 

In this paper we consider how one may select x "op- 
timally" from a statistical viewpoint. We first review 
how the statistical approach can be applied to neu- 
ral networks, as described in MacKay [1992] and Cohn 
[1994]. We then consider two alternative, statistically- 
based learning architectures: mixtures of Gaussians and 
locally weighted regression. While optimal data selec- 
tion for a neural network is computationally expensive 
and approximate, we find that optimal data selection for 
the two statistical models is efficient and accurate. 

2 ACTIVE LEARNING - A 
STATISTICAL APPROACH 

We denote the learner's output given input x as y(x). 
The mean squared error of this output can be expressed 
as the sum of the learner's bias and variance. The vari- 
ance u 2 {x) indicates the learner's uncertainty in its esti- 
mate at a;. 1 Our goal will be to select a new example x 
such that when the resulting example (x, y) is added to 
the training set, the integrated variance IV is minimized: 
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Here, P(x) is the (known) distribution over X. In prac- 
tice, we will compute a Monte Carlo approximation of 
this integral, evaluating u 2 at a number of random points 
drawn according to P(x). 

Selecting x so as to minimize IV requires comput- 
ing it?, the new variance at x given (x,y). Until we 
actually commit to an x, we do not know what corre- 
sponding y we will see, so the minimization cannot be 
performed deterministically. 2 Many learning architec- 



Unless explicitly denoted, y and a^ are functions of x. 

For simplicity, we present our results in the univariate setting. 

All results in the paper extend easily to the multivariate case. 

2 This contrasts with related work by Plutowski and White 

[1993], which is concerned with filtering an existing data set. 



tures, however, provide an estimate of P(y\x) based on 
current data, so we can use this estimate to compute the 
expectation of it?. Selecting x to minimize the expected 
integrated variance provides a solid statistical basis for 
choosing new examples. 

2.1 EXAMPLE: ACTIVE LEARNING WITH 
A NEURAL NETWORK 

In this section we review the use of techniques from Op- 
timal Experiment Design (OED) to minimize the es- 
timated variance of a neural network [Fedorov, 1972; 
MacKay, 1992; Cohn, 1994]. We will assume we have 
been given a learner y = f^ (), a training set {(xi, yi)}Y^i 
and a parameter vector w that maximizes a likeli- 
hood measure. One such measure is the minimum sum 
squared residual 
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The estimated output variance of the network is 
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The standard OED approach assumes normality and 
local linearity. These assumptions allow replacing the 
distribution P(y\x) by its estimated mean y(x) and vari- 
ance S 2 . The expected value of the new variance, u 2 , is 
then: 



,2 OJO^Z) 



\ a y) ~ a y c2 _l ^(z.\ 



S 2 + a 2 (£Y 



[MacKay, 1992]. 
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where we define 
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For empirical results on the predictive power of Equa- 
tion 2, see Cohn [1994]. 

The advantages of minimizing this criterion are that 
it is grounded in statistics, and is optimal given the as- 
sumptions. Furthermore, the criterion is continuous and 
differentiable. As such, it is applicable in continuous 
domains with continuous action spaces, and allows hill- 
climbing to find the "best" x. 

For neural networks, however, this approach has many 
disadvantages. The criterion relies on simplifications 
and strong assumptions which hold only approximately. 
Computing the variance estimate requires inversion of a 
|u>| x |u>| matrix for each new example, and incorporat- 
ing new examples into the network requires expensive 
retraining. Paass and Kindermann [1995] discuss an ap- 
proach which addresses some of these problems. 

3 MIXTURES OF GAUSSIANS 

The mixture of Gaussians model is gaining popularity 
among machine learning practitioners [Nowlan, 1991; 
Specht, 1991; Ghahramani and Jordan, 1994]. It as- 
sumes that the data is produced by a mixture of N Gaus- 
sians gi, for i = 1, ..., N . We can use the EM algorithm 



[Dempster et al, 1977] to find the best fit to the data, 
after which the conditional expectations of the mixture 
can be used for function approximation. 

For each Gaussian gi we will denote the estimated in- 
put/output means as jj, Xy i and jj, y i and estimated covari- 
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ances as a x s -, a i and c xy i. The conditional variance of 
then be written 



yk.' 



We will denote as ri{ the (possibly fractional) number 
of training examples for which g{ takes responsibility: 
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These expectations and variances are mixed according 
to the prior probability that g{ has of being responsible 
for x: 

hi = hi(x) - 



^N 



£7=1 ^li) 

For input x then, the conditional expectation y of the 
resulting mixture and its variance may be written: 
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In contrast to the variance estimate computed for a neu- 
ral network, here u 2 can be computed efficiently with no 
approximations . 

3.1 ACTIVE LEARNING WITH A 
MIXTURE OF GAUSSIANS 

We want to select x to minimize ( a 2 ) . With a mixture of 

Gaussians, the model's estimated distribution of y given 
x is explicit: 
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straightforward: we model the change in each §i sep- 
arately, calculating its expected variance given a new 
point sampled from P(y\x, i) and weight this change by 
hi. The new expectations combine to form the learner's 
new expected variance 
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where the expectation can be computed exactly in closed 
form: 
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4 LOCALLY WEIGHTED 
REGRESSION 

We consider here two forms of locally weighted regression 
(LWR): kernel regression and the LOESS model [Cleve- 
land et al, 1988]. Kernel regression computes y as an 
average of the yi in the data set, weighted by a kernel 
centered at x. The LOESS model performs a linear re- 
gression on points in the data set, weighted by a kernel 
centered at x. The kernel shape is a design parameter: 
the original LOESS model uses a "tricubic" kernel; in 
our experiments we use the more common Gaussian 

hi(x) = h(x — X{) = exp(— k(x — X{) ), 

where A; is a smoothing constant. For brevity, we will 
drop the argument x for hi(x), and define n = X^/jj. 
We can then write the estimated means and covariances 
as: 
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We use them to express the conditional expectations and 
their estimated variances: 
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4.1 ACTIVE LEARNING WITH LOCALLY 
WEIGHTED REGRESSION 



Again we want to select x to minimize (erf,) ■ With LWR, 
the model's estimated distribution of y given x is explicit: 

P(y\i) = N(y(i),a 2 Ax (i)) 



The estimate of ( u 2 



is also explicit. Defining h as the 
weight assigned to x by the kernel, the learner's expected 
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new variance is 
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where the expectation can be computed exactly in closed 
form: 
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5 EXPERIMENTAL RESULTS 

Below we describe two sets of experiments demonstrat- 
ing the predictive power of the query selection criteria in 
this paper. In the first set, learners were trained on data 
from a noisy sine wave. The criteria described in this pa- 
per were applied to predict how a new training example 
selected at point x would decrease the learner's variance. 
These predictions, along with the actual changes in vari- 
ance when the training points were queried and added, 
are plotted in Figures 1, 2, 3, and 4. 

In the second set of experiments, we applied the tech- 
niques of this paper to learning the kinematics of a two- 
joint planar arm (Figure 5; see Cohn [1994] for details). 
Below, we illustrate the problem using the LOESS algo- 
rithm. 

An example of the correlation between predicted and 
actual changes in variance on this problem is plotted in 
Figure 6. Figures 7 and 8 demonstrate that this cor- 
relation may be exploited to guide sequential query se- 
lection. We compared a LOESS learner which selected 
each new query so as to minimize expected variance with 
LOESS learners which selected queries according to var- 
ious heuristics. The variance-minimizing learner signifi- 
cantly outperforms the heuristics in terms of both vari- 
ance and MSE. 

6 SUMMARY 

Mixtures of Gaussians and locally weighted regression 
are two statistical models that offer elegant representa- 
tions and efficient learning algorithms. In this paper we 




Figure 1: The upper portion of the plot indicates the 
neural network's fit to noisy sinusoidal data. The 
lower portion of the plot indicates predicted and ac- 
tual changes in the network's average estimated vari- 
ance when x is queried and added to the training set, for 
x G [0, 1]. Changes are not plotted to scale with fits. 
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Figure 2: Fit to data and correlation for a mixture of 
Gaussians. 



have shown that they also offer the opportunity to per- 
form active learning in an efficient and statistically cor- 
rect manner. The criteria derived here can be computed 
cheaply and, for problems tested, demonstrate good pre- 
dictive power. 

References 

W. Cleveland, S. Devlin, and E. Grosse. (1988) 
Regression by local fitting. Journal of Econometrics 
37:87-114. 

D. Cohn, L. Atlas and R. Ladner. (1990) Train- 
ing Connectionist Networks with Queries and Selective 
Sampling. In D. Touretzky, ed., Advances in Neural In- 
formation Processing Systems 2, Morgan Kaufmann. 

D. Cohn. (1994) Neural network exploration using 
optimal experiment design. In J. Cowan et al., eds., 
Advances in Neural Information Processing Systems 6. 
Morgan Kaufmann. 



Kernel Regression 




Figure 3: Fit to data and correlation for kernel regres- 
sion. 



, LOESS 




Figure 4: Fit to data and correlation for LOESS model. 



A. Dempster, N. Laird and D. Rubin. (1977) Max- 
imum likelihood from incomplete data via the EM algo- 
rithm. J. Royal Statistical Society Series B, 39:1-38. 

V. Fedorov. (1972) Theory of Optimal Experiments. 
Academic Press, New York. 

Z. Ghahramani and M. Jordan. (1994) Supervised 
learning from incomplete data via an EM approach. In 
J. Cowan et al., eds., Advances in Neural Information 
Processing Systems 6. Morgan Kaufmann. 

A. Linden and F. Weber. (1993) Implementing in- 
ner drive by competence reflection. In H. Roitblat et 
al., eds., Proc. 2nd Int. Conf. on Simulation of Adaptive 
Behavior, MIT Press, Cambridge. 

D. MacKay. (1992) Information-based objective func- 
tions for active data selection, Neural Computation 4(4): 
590-604. 

S. Nowlan. (1991) Soft Competitive Adaptation: Neu- 
ral Network Learning Algorithms based on Fitting Sta- 
tistical Mixtures. CMU-CS-91-126, School of Computer 
Science, Carnegie Mellon University, Pittsburgh, PA. 

Paass, G., and Kindermann, J. (1995). Bayesian 
Query Construction for Neural Network Models. In this 
volume. 




Figure 5: The arm kinematics problem. 
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Figure 6: Predicted vs. actual changes in model vari- 
ance for LOESS on the arm kinematics problem. 100 
candidate points are shown for a model trained with 50 
initial random examples. Note that most of the poten- 
tial queries produce very little improvement, and that 
the algorithm successfully identifies those few that will 
help most. 
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Figure 7: Variance for a LOESS learner selecting queries 
according to the variance-minimizing criterion discussed 
in this paper and according to several heuristics. "Sen- 
sitivity" queries where output is most sensitive to new 
data, "Bias" queries according to a bias-minimizing cri- 
terion, "Support" queries where the model has the least 
data support. The variance of "Random" and "Sensitiv- 
ity" are off the scale. Curves are medians over 15 runs 
with non- Gaussian noise. 
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Figure 8: MSE for a LOESS learner selecting queries 
according to the variance-minimizing criterion discussed 
in this paper and according to the heuristics described 
in the previous figure. 



